Project-Team:REALOPT

Inria | Raweb 2016 | Presentation of the Project-Team REALOPT | REALOPT Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: New Results

Scheduling and Placement for HPC

With the complexification of the architecture of HPC nodes (multicores, non uniform memory access, GPU and accelerators), a recent trend in application development is to explicitely express the computations as a task graph, and rely on a specialized middleware stack to make scheduling decisions and implement them. Traditional algorithms used in this community are dynamic heuristics, to cope with the unpredictability of execution times. In [12], we analyze the performance of static and hybrid strategies, obtained by adding more static (resp. dynamic) features into dynamic (resp. static) strategies. Our conclusions are somehow unexpected in the sense that we prove that static-based strategies are very efficient, even in a context where performance estimations are not very good. We also present and generalize HeteroPrio, a semi-static resource-centric strategy based on the acceleration factors of tasks. In [19], we generalize this strategy to platforms with more than two types of resources. This allows to use intra-task parallelism by grouping several CPU cores together. In [27], we prove tight approximation ratios for HeteroPrio in the context of independent tasks, providing a theoretical insight to its good practical performance.

Another study [26] focuses on the memory-constrained case, where tasks may produce large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. There is a known, polynomial time algorithm [55] to minimize the peak memory used on one machine for the cases where the input graph is a rooted tree. We generalize in [26] to the variant where the input graph is a directed series-parallel graph, and propose a polynomial time algorithm. This allows to solve this practical problem in two important classes of applications.

In [13], we consider the static problem of data placement for matrix multiplication in heterogeneous machines, so as to optimize both load balancing and communication volume. This is modeled as a partitioning of a square into a set of zones of prescribed areas, while minimizing the overall size of their projections onto horizontal and vertical axes. We combine two ideas from the literature (recursive partitioning, and optimal solution structure for low number of processors) to obtain a non-rectangular recursive partitioning (NRRP), whose approximation ratio is $\frac{2}{\sqrt{3}} ≃ 1.15$ , improving over the previous $1.25$ ratio. Moreover, we observe on a large set of realistic platforms built from CPUs and GPUs that this proposed NRRP algorithm allows to achieve very efficient partitionings on all considered cases. In [14], we consider the generalization of this problem to the three dimensional case. We prove the NP-completeness of the problem, and propose a generalisation of NRRP with a ${(\frac{5}{6})}^{\frac{2}{3}}$ approximation ratio.

Previous |

Home | Next next